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Abstract 

We describe a high performance parallel implementation of a derivative pricing model, within which we introduce 
a new parallel method for the calibration of the industry standard SABR (stochastic-a/Jp) stochastic volatility model 
using three strike inputs. SABR calibration involves a non-linear three dimensional minimisation and parallelisation 
is achieved by incorporating several assumptions unique to the SABR class of models. Our calibration method is 
based on principles of surface intersection, guarantees convergence to a unique solution and operates by iteratively 
refining a two dimensional grid with local mesh refinement. As part of our pricing model we additionally present a fast 
parallel iterative algorithm for the creation of dynamically sized cumulative probability lookup tables that are able to 
cap maximum estimated linear interpolation error We optimise performance for probability distributions that exhibit 
clustering of linear interpolation error. We also make an empirical assessment of error propagation through our pricing 
model as a result of changes in accuracy parameters within the pricing model's multiple algorithmic steps. Algorithms 
are implemented on a GPU (graphics processing unit) using Nvidia's Fermi architecture. The pricing model targets 
the evaluation of spread options using copula methods, however the presented algorithms can be applied to a wider 
class of financial instruments. 



1. Introduction 

The pricing of financial derivatives is a computationally 
demanding task and for many users of financial deriva- 
tives such pricing is undertaken by large scale comput- 
ing farms. The use of highly parallel co-processors such 
as GPUs (graphics processing units), as opposed to the 
sole use of CPUs (central processing uruts), is thought 
to offer computational speed and cost improvements off- 
set against the complexity of deriving algorithms able to 
substantially exploit highly parallel architectures. Com- 
putational speed and cost improvements are based on 
the observation that highly parallel architectures such 
as GPUs typically exhibit superior levels of peak mem- 
ory and arithmetic throughput rates compared to CPU 
architectures. 

Within the context of financial derivatives pricing, a di- 
verse set of computational techniques exist, with tra- 
ditional methods constituting the following: analytic 
methods which calculate closed-form solutions of par- 



tial differential equations (PDEs) (Black and Scholes 
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or probabilistic expectations ( Cox and Ross 
, finite dijference methods ( |SchwaJtz| 1977 1 which 
approximate solutions to PDEs using difference equa- 
tions, Monte Carlo methods ( |Boyle| |1977| l which at- 
tempt to simulate stochastic processes and finally, tree 
based methods (Cox e t aT) |1979| l which create lattice 
structures for stochastic processes. Aside from ana- 
lytic methods, which do not require significant com- 
putational effort, several works have demonstrated how 
the inherent parallelism exhibited by such computa- 
tional methods enables effective use of highly parallel 
co-processors such as GPUs. In particular, Monte Carlo 
methods, due to their often significant levels of inherent 
parallelism, have shown particular performance gains 
over CPU based alternatives ( |Joshi[ |2010| 'Bennem ann 



et al. 2008 Dixon etal. 2012 1. Modern pricing meth- 



ods for financial derivatives are known to rely on addi- 
tional computational techniques, such as iterative min- 
imisation algorithms, which are often computationally 
expensive ( |Brigo and Mercuriol|2006| l. As such, within 
this paper we formulate high performance parallel algo- 
rithms for the acceleration of a derivative pricing model 
that is composed of both Monte Carlo and iterative al- 
gorithms, where we focus solely on iterative algorithms 



operating in the presence of variable accuracy parame- 
ters. 

The first type of iterative algorithm we consider is the 
calibration of the SABR (stochastic-a^Sp) model ( |Ha-| 
|gan et aL| |2002) l. We present a new parallel method 
which is designed to be implemented on highly parallel 
architectures such as GPUs. SABR calibration is used 
to ensure that the output from the SABR model agrees 
with market observations of implied volatility, this es- 
sentially corresponds to a minimisation problem. SABR 
models are common within derivative pricing and al- 
though it is typical for financial market participants to 
use customised forms of the SABR model, the method 
we present is applicable to all SABR models consistent 
with a set of generic properties we highlight within the 
paper. Typically the SABR model calibration is through 
minimisation algorithms optimised and designed for 
single thread execution, as such known methods to cali- 
brate SABR models ( f^stl |2005j pariE] [2011] [Nilsson 



C"Mkt 



2008|l incl ude off-the-shelf Simplex methods ('Nelder' 
and Mead|| 1965[ ), Levenberg-Marquardt methods (Mar-, 



quardt 1963| l and gradient descent methods (Press et al. 



2007^ Our new parallel SABR calibration method op 



erates by iteratively refining a two dimensional grid 
until a unique set of SABR parameters is found that 
match SABR generated option volatilities to a standard 
input of three market observed option implied volatil- 
ities. The method is based on principles of surface 
intersection and is shown to offer guaranteed conver- 
gence. 

The second type of iterative algorithm we consider is the 
creation of lookup tables used to store cumulative prob- 
ability distributions. We present GPU based parallel al- 
gorithms that create dynamically sized lookup tables, 
are able to cap maximum estimated linear interpolation 
error and are optimised for probability distributions of 
the type implied by the SABR model which exhibit lin- 
ear interpolation error clustering. 

Our two stated iterative algorithms are combined within 
a derivative pricing model, for which we empirically 
highlight error propagation and performance results. 
We note that in addition to the possibility of using paral- 
lel algorithms for a single SABR calibration or the cre- 
ation of a single lookup table, further parallelism is ex- 
hibited by our pricing model as it contains multiple in- 
dependent coupons, each of which require the same cal- 
culations (Nasar-Ullah||2012j . Results are generated us- 
ing the system properties shown in Appendix A 




K 



Figure 1: An example of the market implied volatility 
o"Mkt of three European options exhibiting a volatility 
smile. All three options contracts are identical except 
that they have different strike prices K. 



1.1. Structure of paper 

In section |2] we introduce a new parallel method for 
SABR calibration. In section [3] we introduce parallel 
algorithms for the creation of cumulative probability 
lookup tables. In section |4] we turn to our derivative 
pricing model as a whole (comprising of both SABR 
calibration and the creation of cumulative probability 
lookup tables), where we first empirically highlight er- 
ror propagation through the pricing model as a result 
of changing accuracy parameters and secondly present 
overall timing results. 



2. SABR model calibration 

2.1. Background 

Options are financial contracts where at an initial time 
f = the buyer of the option acquires the right but not 
the obligation to fulfil a given transaction at a later time. 
The simplest type of option, the European call option, 
confers upon the holder the right to buy an underlying 
asset F at a future time T for a fixed strike price K. The 
payoff (i.e. the final value of the option at time T) can 
be expressed as follows, where F{T) is the final asset 
price; 



Call payoff = Max(F(r) - K, 0). 



(1) 



The seminal Black Scholes formula ( |Black and Scholesl 
1973^ is used to give the initial price V of such op- 
tions at f = 0. The Black Scholes formula is a function 
V{F{Q), K, T, r, cr), where additionally F{Q) is the initial 
asset price, r is a risk free rate and cr is a measure of 
asset volatility. 

The market implied volatility o-Mkt is defined as the 
volatility such that when input into the Black Scholes 
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Minimum eiTor 
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Figure 2: An example of a 2 x 2 grid refining over a 
two dimensional space. The grid iterates from coarse 
spacing to fine spacing in order to identify the precise 
location of the global minimum error The global mini- 
mum error is located in the filled region. 

formula, alongside the parameters F(Q), K, T, r of an 
observed market option with price VMkt. the result- 
ing Black Scholes price matches Vivikt, that is VMkt = 
y(F(0), K, T, r, (TMkt)- Empirical evidence for several 
markets suggests the presence of a 'volatility smile' 
( |Hagan et al.||2002 |, a phenomenon where identical op- 
tion contracts except for different strikes K have differ- 
ent implied volatilities as shown in figure [T] The Black 
Scholes model however assumes a constant volatility 
when calculating the price of such options, in doing so a 
single set of Black Scholes parameters cannot quantify 
a volatility smile. 

With a view to capturing volatility smile dynamics, the 
SABR stochastic volatility model (Hagan et al. 2002| l 
is a popular choice which (unlike the Black Scholes 
model) is able to generate strike dependent volatilities 
o"SABR- Several SABR parameters must be estimated 
to fully specify a SABR model. The estimation of 
SABR parameters is typically conducted by a three di- 
mensional minimisation, referred to as calibration, that 
ensures strike dependent SABR volatilities itsabr repli- 
cate actual market implied volatilities o-Mkt for a number 
of market observed options. 



The general SABR model can be given as: 

dF(t) = s(t)F(tfdW(t), 
ds(t) = as(t)dZ(t), 
dWit)dZ(t) = pdt. 



(2) 



where F(t) is an asset price process, s(t) is a volatil- 
ity process and dW{t), dZ{t) are two stochastic Brown- 
ian motion processes. Additionally, the SABR model 
has five parameters: /3 is an asset price exponent term. 
Of is a term representing the volatility of the volatility 



process s(f), p represents a correlation between dW(t) 
dZ(t), F(Q) is an initial asset price which is directly ob- 
servable and s{Q) is an initial volatility value. 

Since the value of F(Q) is directly observable, there are 
four unknown remaining parameters which require es- 
timation. In our implementation, SABR parameters at- 
tempt to replicate an observed market volatility smile 
based on a standard approach of observing the im- 
plied volatilities of three market options with strikes 
K-,Kat:m and K+ (see figure [T]), where Katm refers to 
the option's strike price being equal to the initial price 
of the underlying asset (i.e. K - F(Q)) which is known 
as an ATM (at-the-money) option, and where K < 
Katm < K+. Corresponding market implied volatili- 
ties are given as crMkt(^^-), o"Mkt(^ATM) and crMkt(^^+)- 
Further, we use the approach of arbitrarily selecting a 
constant value for the parameter /3 (where < /3 < 1) 
Pagan et al] [2002] \Wesi\ [2005] l. Consequently we re- 
quire calibration of the three remaining parameters: a, 
p and s(0). It is further noted that the value of 5(0) can 
be calibrated directly from fixed values of a and p based 



on crMkt(^ATM) ( |Hagan et ar.|(2002, West||2005 1. 



The minimisation or cahbration objective can be stated 
formally as follows, where SABR volatility csabr is 
represented by a function of the previously stated vary- 
ing parameters as crsABRi^i,o:,p, s(Q)), where K e 
Katm,K+}, i - 3 and where calibrated values are 
Hsted as a*,p*, i(0)*: 



{a',p\s(Oy] = 

3 

argmin ^ |crsABR(K/, a,p, i(0)) - o-Mkt(K;)| . (3) 

a,p.s(fi) 



Using a three strike calibration approach, the calibrated 
values {a*,p* , s(0)*} uniquely minimise ([3]l such that the 
summation on the RHS (right hand side) of ([3]l becomes 
zero. 



2.2. A parallel method for SABR calibration 

Our proposed calibration method attempts to ascertain 
the location of the global minimum satisfying the min- 
imisation in ([3]), i.e. the values {cK*,p*, i(0)*), by it- 
eratively refining a two dimensional grid until a de- 
sired level of SABR calibration error is achieved, where 
SABR calibration error is calculated as the summation 
on the RHS of ([3]). An example of such iterative grid 
refinement can be seen in figure [2] An illustration of 
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(a) Front view (b) Side view 

Figure 3: Typical SABR calibration error as calculated by the summation on the RHS of ([sjl where the parameters 
a and p are varying. s{Q) has been previously calibrated based on (j4]) for each pair of a and p. Error is reported as 
- logjo error (e.g. 8.5 represents an error of 10 *'^) and the minimum error is located at the peaks. 



typical SABR calibration error based on changes to 
a and p is shown in figure [5] (the parameter s{Q) has 
been previously calibrated for each pair of a and p). 
Within our proposed calibration method each iteration 
of the refined grid contains the global minimum, how- 
ever (as seen in figure|3]l the SABR error surface is non- 
linear, hence establishing the presence of a global mini- 
mum within arbitrary points of a grid is non-trivial. To 
overcome this problem our method relies on exploiting 
known characteristics of the underlying SABR model to 
form a method based on principles of surface intersec- 
tion. 

We note that it is typical for separate market partici- 
pants to use mathematically varied implementations of 
the original SABR model shown in (j2]). As a result 
our proposed calibration method is designed to be ap- 
plicable to all SABR models consistent with a set of 
generic properties we highlight in the remainder of this 
section. 



Next we define error matrices M_ , Matm and M+ which 
are of size mxn, where m is equal to the size of 
an array a containing a range of a values indexed as 
{ai, ■ ■ ■ ,am], and n is equal to the size of an array p 
containing a range of p values indexed as {pi, ■ ■ ■ ,p„). 
The matrices M(.) thus cover a two dimensional domain 
of a and p with elements: 



M(.) = 



(■) (■) 



]y[«»,P2 



]y[Qr2,p„ 
(■) 



]y[«„,P„ 



(5) 



where each element M^j'' represents the error defined 



as: 



2.3. Preliminaries 

As stated the parameter s(0) can be calibrated from fixed 
values of the parameters a and p based on o-Mkt(^ATM)^ 
thus we define s{Q)*{a,p) as a value of ^(0) which 
is calibrated for a particular combination of {a,p) 
where: 



5(0)*<a,p) = 

arg min |crsABR(^^ATM, a,p, s{Qi)) - o-Mkt(^^ATM)l ■ (4) 



M°^l^ = o-sABR {K(.), a,p, s(Oy{a,p}) - crMkt {K( )) . 



(6) 



Due to the minimisation within Q, each element within 
the matrix Matm has a value of zero (or a value rep- 
resenting the accuracy to which 5(0) was minimised 
within Q). An example of the surfaces of M_ and M+ 
can be seen in figures 4a and 4b 



In the two dimensional domain of a and p (within our 
error matrices M_,M+) we assume there is a point C 
that represents the unique combination (a*,p*) where 
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p p p 

(a) Surface of M_ and the line A_ (b) Surface of M-,- and the Hne A+ (c) Top down view of A_ and A+ 



Figure 4; |(a)| and |(b)| are typical examples of the surfaces of M_ and M+. Intersection of the surfaces of M_ and M+ 
with the plane z = is shown by the solid lines defined as A_ and A+ respectively. The z axis represents an error 
value shown in (j6]) stored within each element of our error matrices M_ and M+. (c) shows a top-down view of the 



intersection between A_ and A+, where the left hand side of M_ is positive upto A_ and the right hand side of M+ is 
positive upto A+. 



both M!*'''* = and M'^* ''* = 0. Thus, the point C rep- 
resents the solution which uniquely satisfies the minimi- 
sation in We note that based on ([3]l, a and p can be 
calibrated equivalently as follows, where siQ) is implic- 
itly calibrated due to (|6]|: 



{a,p*, siO)*] = arg min |M'!'''| + \M"/\. (7) 



where and are monotone. 

We note that (|9| results in a curvature or smile effect 
( Hagan et al.|,2002j which is further described in fig- 
Critically, we state that (|9]l holds only in the 



5b 



region of C, however as stated within our subsequent 
discussion we assume that we are always able to ob- 
serve the region of C based on our initial discretisation 
of a (within a). 



Our proposed calibration method is based on the follow- 
ing generic properties, which are typical of all SABR 
models: 



Property 1. 



5M_ 



<0, 









dp 


1 f5M- 1 




1 ''>P 1 


~ 1 1 



> 0, 



(8) 



where in the region of C: 

We note that ([8]) results in a steepness or skew effect 
(|Hagan et al.||2002 | which is further described in figure 
5a Furthermore, ([8|l implies that the surfaces of M_ and 
M+ intersect the zero plane at lines of intersection A_ 
and A+, as can be seen in figure|4] Due to the presence 
of a unique solution C, the lines A_ and A+ themselves 
intersect uniquely at C. 



Property 2. In the region of C: 

d (M_ + M+) 
da 



>0, 



(9) 



2.4. Algorithm descriptions 

Our proposed calibration method operates by iteratively 
refining a two dimensional grid (or mesh) of a and p 
values until the values a* ,p* are obtained, an example 
of which is shown figure |6] Each iteration involves the 
construction of the matrices M_ and M+, as shown in 
Q to (j6]l. Parallelism is achieved during the construc- 
tion of M_ and M+, whereby each element within such 
matrices is computed independently by multiple parallel 
threads. 

In order to refine the grid of a and p values for a sub- 
sequent iteration we require non-trivial information re- 
lating to the location of the solution C. This informa- 
tion is obtained through four principle steps, whereby 
the first step obtains lines of intersection A_ and A+. 
The second and third steps respectively obtain lower 
and upper bounds on a within a and lower and up- 
per bounds on p within p. The stated lower and upper 
bounds are designed to guarantee the bracketing of the 
solution C. Finally, the fourth step (which can be used 
optionally) involves estimating the precise location of 
the solution C. This enables the grid for a subsequent 
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Katm K+ 
(a) Varying p 




(b) Varying a in the region of C 



initial a 
at 
a 4- 



K 



Figure 5: (a) shows the SABR 'steepness' effect based on changes to p. For increasing/decreasing values of p, the 
gradient cr{K^) - o-{KJ) increases/decreases, this is achieved by decreasing/increasing cr{K_) and simuhaneously 



increasing/decreasing cr{K+). (b) shows the SABR 'curvature' effect based on changes to a in the region of C. For 
higher/lower values of a, the 'curvature' o-{K+) + o-{KJ) increases/decreases. 



iteration to incorporate finer meshing or local mesh re- 



finement around the estimate of C (see figures 6b to^ 



hence improving the rate of convergence to the solution 
C. We now present a detailed description of the stated 
four steps. 



2.4.1. Step 1: Obtaining lines of intersection A_ 
A+ (algorithm^ 



and 



The lines A_, A+ represent lines of intersection between 
the surfaces of M_,M+ and the zero plane, examples 
of which are shown in figure |4] Within our calibration 
method we choose A(.) to be a vector of size m, where m 
is equal to the size of a. Each value within a, referred 
to as a,, has a corresponding value A(.)(q',). The value 
A_(a,) represents a column index of p (within p) such 
that M!"'^-'"-' > and m!"'^-^"''^' < and the value 
A+(Qr,) represents a column index of p (within p) such 



thatM' 



> and M'; 



ci'i,A+(a/)-l 



< 0. 



We note that the lines A_, A+ represent intersections be- 
tween the surfaces of M_, M+ and the zero plane based 
on an index of p (as opposed to an index of a), this is 
due to the unidirectional or monotone behaviour of ^5-^ 

dp 

which ensures intersection with the zero plane always 
occurs for a value of p. Due to (|8]l we expect the pos- 
itive region of M_ to be for lower indices of p (within 
p), hence we initialise each value in A_ to represent the 
lowest index of p. Correspondingly we initialise each 
value in A+ to represent the highest index of p. A list- 
ing of this step is shown in algorithm[T] 



Algorithm 1: Calculate A , A+ 



1 Loop over all a values; 

2 for i = ai to a,„ do 

3 Initialise values; 

4 A_(0^pi; 

5 A+(0«-p„; 

6 Loop from low to high p values; 

7 for i = pi to p„ do 

8 if M^' > then 

9 [_ A_(0 ^ j; 

10 Loop from high to low p values; 

11 for i = p„ to pi do 

12 if M'/ > then 

13 I A+(0 <- j; 



2.4.2. Step 2: Calculating bounds for a (algorithm 

HI 

The second step brackets a* within lower and upper 
a bounds [a,, ay], where 0.^,0^ are a values within 
a. 

In the region of C, due to (j9]) and since M_ and M+ are 
independent it follows that: 



da da 



(10) 



As a result of ( [10) 1 three cases arise in the region of 
C: 



1. 



> and — — > 0. 



da 



da 



(11) 




(a) Iteration 1 




(d) Iteration 4 
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(b) Iteration 2 
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(e) Iteration 5 



(c) Iteration 3 
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(f) Iteration 6 



Figure 6: An example of the parallel SABR calibration method iterating to a solution. The y axis represents values 
of a and the x axis represents values of p. The colourbar value of each subgrid represents the error associated with 
the subgrid's top-left corner, where error is calculated as the RHS of (|7]). Error is reported as -logjg error (e.g. 8.5 
represents an error of lO"^'^). The clear white lines within each grid represent bounds used for the next iteration. 
Local mesh refinement (algorithm 5) is not enabled during the first iteration. 



Algorithm 2: Calculate a bounds ffj and a/ 

1 Initialise values; 

2 a, «— ai ; 

3 af «- a,,,; 

4 Loop from low to high a values; 

5 for ; = ui to a„, do 

6 if A_(0 < A+(!) - 1 tlien 

7 |_ ffi ;; 

if A_(«) < A+(0 then 
|_ tt/ <— I + 1; 

10 Ensure answer is not out of bounds; 

11 Of «— Min(Q'/, tt,,,); 



Under this case when p - p* both M"''' and Ml'^ 
are negative when a < a* (and positive when a > 

a'). 



5M_ 
da 



< and 



5M+ 
da 



5M_ 



da 



(12) 



Under this case when p - p* only is negative 
when a < a* (and positive when a > a*). 



3. 



— — < and — — 

oa oa 



dM. 



da 



(13) 



Under this case when p - p* only M" '' is negative 
when a < a* (and positive when a > a*). 

Furthermore we observe that it is only when a < a* 
that both M"''' and M"''' can be negative, and only when 
a > a* that both M'!''' and M"''' can be positive. 

The preceding statement is self-evident in case 1 and 
is demonstrated by figures |7a] and |7b] In case 2, when 
p - p* and a <a* , M"''^ is positive, however due to ( 12 1 
the positive magnitude of M" '' is dominated by the neg- 
ative magnitude of M"'''. Further, due to j^^j a; |^| 
(property [TJ and (|8]l we observe that a small increase 
in p will shift M" into a negative region before shift- 
ing M" '' into a positive region, hence ensuring that both 
M"''' and M"''^ are negative, this is demonstrated in fig- 
ure 7c A similar principle can be applied when a > a* 



and to case 3 (see figure 7d i 



Our a bounds can now be determined as follows: 

A lower bound on a, a.,, is determined by the highest 
index of Ui ( within a) such that A-(ai) < A+(a,) — 1. An 
upper bound on a, Uf, is determined by the lowest index 
of oi (within a) such that A_(ai) > A+(a;). 

To demonstrate the preceding statement, since Cj satis- 
fies A-(as) < A+(a.s) - I, within the rows M""*' and 
M""*'' there are corresponding elements (i.e. elements 
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pj pf 



Ps Ps 



(c)Case2:^<0,^>^ 



Ps Ps Pf Pf 

dU I flM+ I aM+ 



(d) Case 3: 



3a 



> 



da 



da 



<o 



Figure 7: The shaded area bounded by A_ represents the region where M_ is negative, the shaded area bounded by 
A+ represents the region where M+ is negative. Cases 1 to 3 represent equations ( 11 1 to ( 13 1. Depicted a values are 



assumed to be in the region of C. Bounds for p are given by the solid lines meeting the x axis and have been calculated 
by the zero line (ZL) method (algorithm[3jl. 



with the same a,p index) that are both negative. Con- 
sidering the corresponding elements M""'' and M""'^ are 
negative, due to ([8]l, any movement in p will result in at 
least one of M""'' and M""'^ becoming more negative, 
hence the solution C cannot be obtained when a - a,,. 
This effect is shown in figure 8a Recall, a, is deter- 



mined by the highest index of a,, this is because when a 
increases from being below a* to greater than a*, both 
M"''^ and M"''' go from being negative to positive, as 
was demonstrated earlier. 

Also, since or/ satisfies A_(q;/) > A+{af), within the 
rows M"-'^''^^ and M^!^'*' there are corresponding ele- 
ments that are both positive. Considering the corre- 
sponding elements M'!'''^ and M"'"'' are positive, due 
to dSll, any movement in p will result in at least one of 
M_ and M"^ '' becoming more positive, hence the so- 
lution C cannot be obtained when a - a/. This effect 



is shown in figure 8c Recall, a/ is determined by the 



lowest index of a, as was described in the previous para- 
graph. 

Rows represented by M'!'*'^ and M"''' such that ffj < 
a < aj may contain the solution C due to the presence 
of corresponding elements with opposite signs. Thus 
movements in p will result in decreasing absolute error 



in both corresponding elements, an example of which 
is shown in figure [8b] Example a bounds are shown in 
figure 9a A listing of this step is shown in algorithm 

m 



2.4.3. Step 3: Calculating bounds for p 

The third step brackets p* within lower and upper p 
bounds [p.5,p/], where Ps,Pf are p values within p. Two 
methods exist for this purpose, firstly, the zero line (ZL) 
method, and secondly, the relative value (RV) method 



that is based on an additional property, shown in ( 16 1, 
relating to the region very close to C. 

Bounds on p are calculated by assuming that the pre- 
viously calculated bounds on a capture the region of C 
(in the a domain). In practice this is achieved by en- 
suring the a domain within a is sufficiently discretised, 
whereby practical convergence examples are provided 
within sectionC 



The zero line (ZL) method (algorithm^ 

Given A(.) and the a bounds Qs^aj we calculate p in- 
dices p^,p^,pj,pj where: 
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(a) a = a 



N2 



(b) a where < a < a/ 



N3 
(c) a = a f 



= M 
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Figure 8: An illustration of a bounds (us, a/)- The y axis represents an error value shown in (j6|. (a) 



as where there is a node of p (Nl) such that both M'!"'^' and M"!!""^ have negative errors, where such negative errors 



shows the bound 



are observed to decrease for changes in p. (c) shows the bound af where there is a node of p (N3) such that both 



af N3 ar N3 

Mj and M_|_ have positive errors, where such positive errors are observed to increase for changes in p. (b) 
shows a value of a such that as < a < a/ where there is a node of p (N2) such that M":^^ and M"'^^ have errors of 
opposite signs, where such absolute errors are observed to decrease for changes p, consequently using this value of a 
it is thus possible for changes in p to result in the solution M"''^ = and Ml'^ - 0. 



Ps =A_(aj), 
p+ = A+iUs), 



p^. = A_(a/), 
pj = A+(af). 



(14) 
(15) 



Our p bounds can now be determined as follows based 
on possible signs of and (as listed within |TT| 
to ([T3|): 



Pj is a lower bound for p when > 0. p^ is a lower 
bound for p when < 0. Pj is an upper bound for 
p when > 0. p^ is an upper bound for p when 
^ <0.'" 

To demonstrate the preceding statement we first attempt 
to show that when '-^^ > the solution C (where 
M"j''' = 0) can only be obtained with values of p that 
are greater than a lower bound (having already estab- 
lish ed that a, is a lower bound on a). As shown in sec- 
tion |2.4.l| M'j"''' > and m!"'''^' < 0. Due to ([sjl there 
exists a value of p namely po such that M""''° = 0, where 



po > p, . Further, given > 0, an increase in as of 
Aa (towards a*) results in m'!'^'^"'''" > 0. Further due to 
([sjl, an increase in po of Ap results in '^^^^^"'P^^^p = q, 
where po + Ap > po > Pj . We have thus demonstrated 
that the solution C (where M" - 0) can only be ob- 
tained by values of p greater than a lower bound p7 . We 
note how this derivation relies on > being mono- 
tone as stated in property |2] The remaining bounds 
(Pj,p^,pp are demonstrated by reapplying the above 
method. 

A depiction of resultant bounds is shown in figure |7] 
For case 1, based on ( 1 1 1, lower and upper p bounds are 



given by p, and p^ respectively (see figures 7a and|7b[). 
This occurs when p^ < p^ and p^ > p^, where the lines 



A_ and A+ are shown to slope in opposite directions. 
For case 2, based on ( [T2] i, the lower p bound is given 
by rather than p^ (see figure 



7c I. This occurs when 



p~ > p^ where the lines A_ and~S+ slope down in the 
same leftward direction. For case 3, based on ( [T3| ), the 
upper p bound is given by p^ rather than p^ (see figure 
7di. This occurs when p^ < p^ where the lines A_ 
and A+ slope down in the same rightward direction. In 
instances when p^ - pJ or p^ - p^, all cases produce 
identical lower or upper bounds respectively. 



Example p bounds generated by the zero line (ZL) 

A listing 



method are shown in figures 9b 9q and 9d 
of this step is shown in algorithm|3] 



Algorithm 3: Calculate p bounds p, and pf based on 
the zero line (ZL) method 

1 p; a_(q',); 

2 Ps ^ A+(q'.,); 

3 pJ «- A_(q'/); 

4 p+ <- A+iOf)- 

5 Bounds as in figure [7c| 

6 if p7 > p^ tlien 



7 






8 






9 


Bounds as in figure [7d| 


10 


else if Pj < pj then 


11 




Pi «- p, ; 


12 




_ Pf'^Pp 


13 


Bounds as in figurespa 
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else 


15 




Pi «- Pi ; 
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Pf^PV- 
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(c) 




0.13 



Figure 9: (a) shows the lines A_ and A+ alongside a bounds generated by algorithm |2] [(b) (c) 



and 



(d) 



show error 



values within M_ and M+ calculated by (j6]l for changing values of p within lower and upper bounds of a. The faintest 
lines for M_ and M+ represent the lower a bound, ff ,, higher a values are represented by lines of increased thickness, 
whereby the thickest lines for M_ and M4 
examples of M_ and M4 



represent the upper a bound, aj. |(a)| and (b) correspond to identical 
(c) and (d) are based on the same data set as figure|4]but with different discretisation, where 



(c) shows an initial iteration enclosing a large area away from C and |(d)| shows a subsequent iteration enclosing an 
area close to C. 



The relative value (RV) method (algorithm^ 

Within this algorithm we make use of an additional 
property of our error matrices M_ and M+: 



Property 3. In the region very close to C: 



0, 



(■) 



5p2 



« 0. 



(16) 



Our previously described zero line (ZL) method (algo- 
rithm [3]l was based on the vectors A(.) which contain 
information relating to the position of positive and neg- 
ative elements within M(.) (through obtaining the line 
of intersection between the surfaces of M(.) and the 



zero plane). Further information can be obtained by 
evaluating differences between corresponding elements 
(i.e. elements with the same a,p index) within M_ and 
M4. 

Using our a bounds as,af, we define /j as the maxi- 
mum p index (within p) such that M""'' > M""'' (where 
due to lil M"-'''+' < M'f'"''+'^ 



) and define // as the 
maximum p index (within p) such that M'!'^'^' > M"^''^ 
(where due to (g M!'"'^^' < M""''^'). 

Our p bounds can now be determined as follows: 

In the region very close to C, the p values corresponding 
to Min(/s,/y) and Max(/s,/y) + 1 are lower and upper 
bounds on p respectively. 



10 



To demonstrate the preceding statement we initially 
identify that in the region of C, due to M""^ ' + 
M""* ' < and that M!"'' > M""''' . Consequently due 
to (|8]i there exists a value of p, indexed by Is + Ap, where 
Is < Is + Ap < /, + 1, such that: 

Similarly, there exists a value of p, namely // + Ap, 
where // < // + Ap < // + 1, such that: 

Also, at the solution C: 



Algorithm 4: Calculate p bounds p, and pf based on 
the relative value (RV) method 

1 Initialise values; 

2 Is Pu 

3 // <-p„; 

4 Loop from low to high p values; 

5 for i = pi to p„ do 

6 ifM'!"'>M^'-'then 

7 I i\ 

8 ifMj'-'>M7''then 

9 ^ L ^ 

10 p, «— Min(/„//^); 

11 Pf «— Max(/„/y) + 1; 

12 Ensure answer is not out of bounds; 

13 p; «-Min(p^,p„); 



(19) 



p. As such we can extrapolate the likely position of 



to 



The movement of error from M"j'^'^'^'' < 

M"j'^'^^'' > is based on changes to both a and p. 
Due to ([T6|, it follows that the above traversal from 



jyja /,+Ap < to M' 



afJf+Ap 



> is based on a fixed 



unidirectional rate of change with respect to a and p, 
whereby the traversal includes the point M"^''' . There- 
fore if Is + Ap < If + Ap it follows that + Ap < p* < 
// + Ap, which results in the p bounds [Is, If + !]• Al- 
ternately, if Is + Ap > If + Ap, p bounds are given as 
[//, /.s-hI]. Thus we set our lower bound p, as Min(/5,/y ) 
and upper bound p/ as Max(/j, //) -H 1. 

Example p bounds generated by the relative value (RV) 
method are shown in figures [9bl [9c| and [9d| A listing of 
this step is show in algorithrnffl 



2.4.4. Step 4: 
5) 



Local mesh refinement (algorithm 



This step is used optionally and involves estimating the 
solution C (i.e. a* ,p*) as a'^'"°'\ p''''°'\ The objective is 
to build a locally refined mesh that is fine near the esti- 
mated solution, and progressively coarse away from the 
estimated solution, thereby attempting to improve the 
rate of convergence to C. 



The location of a^'"'" 
pressed in 



16 1 



pTm.A based on the linearity ex- 
As mentioned in the discussion of the 
relative value (RV) method (algorithm |4|l, the traversal 
of error from ( [T7] l to ([T9| to ( fTSj ) is based on a con- 
stant unidirectional rate of change with respect to a and 



Q,T,irgci pTiigLi -pjjg method works as follows: 



Firstly, we obtain a vector / where each element /,■ 
corresponds to a value a,- within the range Os, - ■ ■ ,af 
and is populated with the maximum p index (within p) 
such that M"'''' > Mj"'' (where due to ^ M"'-''^^ < 
jyjff/Ji+i-) jjQjg jjj^j jjjg relative value (RV) method 

we obtained Is and If which are values of / correspond- 
ing to ff j and af respectively. 



Secondly, we define the following six vectors with in- 
dices / and which are of size equal to /: 



p,{i) = pdi), pi+iii) = p(/i + 1). 



(20) 



We further define the vectors p/ and e/ (note the ab- 
sence of a horizontal line) which are also of size equal 
to /, where p/ represents estimated values of p at which 
M"''' - M1'^ and e/ represents the corresponding esti- 
mated error values. Each estimate within the vectors p/ 
and 6/ is representative of an a value within c,, ■ ■ ■ , af. 
An example of the pairs (p/, ej) is shown in figure 10 
p/ and 6/ are obtained by using the intermediate vectors 
a-,a+, b-,b+ (also of size equal to /) as per the follow- 
ing element-by-element arithmetic: 
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Consequently we are able to build a two dimensional 
locally refined mesh that is centred at Q/Targci^^Targci ^j^^ 
bounded by (a^, a/) and (ps,pf) as shown in figures 6b 
toil 
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Figure 10: Depiction of the local mesh refinement step 
(algorithm 5) obtaining a guess for p^"-"' by estimat- 
ing pi, ej, where pj, ei are points that represent where 
M"''^ = M"'^. The intersection of the line connecting 
elements of p/, e/ with the x axis is taken as a guess for 
pTargci rpjjg y ^jj^jg represents an error value shown in (j6jl. 



el(0 - e^i) I 
O-(0 = — I — 77"> ^-(0 = f-(0 - a_(0 X p/(0, 



Table 1 : Performance of the parallel S ABR calibration 
method implemented on a GPU and a typical gradient 
descent algorithm implemented on a CPU for a prob- 
lem size of 192 calibrations, t - total execution time 
(ms). Iter = iterations. GPU iterations represent the av- 
erage number of successive grid iterations. CPU itera- 
tions represent the average combined number of gradi- 
ent updates on a and p. GPU results are based on the 
relative value (RV) method with local mesh refinement 
(V3). Speedup reported as CPU time / GPU time. Ac- 
curacy is calculated as the RHS of (j7]i. CPU results are 
based on a single threaded implementation. 



fl+(0 = 



Pi(i) - Pi+lii) 
ejd) - eld) 
Piii) - Pi+i(i) 



, b+(i) - eld) - a+{i) Xp,(i), 2.5. Performance results 



b+(i) - b-(i) 

Pid) = 7T —, = a_(0 xp/(0 + bji). 



a+{i) — fl_(/) 



Results are generated using the system properties shown 
Our parallel calibration method is im- 



Appendix A 



a+(i)xpi{i) + b+{i)- 
(21) 



Next, we obtain the index /* that corresponds to the min- 
imum error within e/, alongside the index /** that corre- 
sponds to the neighbour of /* that satisfies: 



sign(e/(/*)) sign(e/(/**)). 



(22) 



Finally, we are able to identify p^"'-™' (seen in figure 
10 1 and a^"'" as follows using the intermediate vari- 
ables ap,bp,aa,ba, where /i = Min(/*,/**) and 12 - 
Max(/*,/") = !i -H 1: 



eidi) - 61(12) 

an = , 

Piih) - Piih) 

^ _ eidi) - eiih) 
a(ii)-a(i2)' 



bp = £i{i\) - ap xpidi), 

ba = e/('i) -OaX a(ii), 
b 

Target _ "a 



(23) 



plemented on GPUs whereby each individual calibra- 
tion is conducted by a different GPU 'thread block' . A 
thread block is viewed as a collection of parallel threads 
where the programmer is able to make use of a shared 
memory space and set synchronisation points. The use 
of a separate GPU thread block for each individual cal- 
ibration is based on our chosen category of derivatives, 
which comprises of multiple coupons resulting in upto 
several hundred individual calibrations for pricing a sin- 
gle derivative. 

Within our implementation the size of each GPU thread 
block is chosen to equal the number of elements within 
M(.) shown in (|5]l, whereby each element within M(.) is 
calculated by a separate parallel thread. After the cre- 
ation of M(.) all threads are synchronised and a single 
thread undertakes the sequential steps 1 to 4 listed in 
section 2.4 We note that the creation of M(.) vastly 
dominates the execution time of steps 1 to 4, thus 
minimising the performance penalty of using a single 
thread. 

Our results depict the following changing parame- 
ters: 
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(d) Accuracy esABR = 10 and thread block size = 64 (up- 
per iteration bounds are shown) 



Figure 1 1 : Performance of the parallel SABR calibration method for a typical problem size of 192 calibrations (except 
^c)| . Iter = average iterations. Time = total kernel execution time (ms), No. - the number of calibrations conducted. 
Villi?) - configuration based on 1. zero line (ZL) method, 2. relative value (RV) method, 3. RV method with local 
mesh refinement. Block size - GPU thread block size. Accuracy le-x = csabr as calculated by the RHS of Q. 



Firstly, our GPU thread block size varies from 36 to 256 
threads. Thread blocks are arranged with x and y coor- 
dinates representing a grid of a and p values as shown in 
(|5]l. Thread blocks are arranged to have (as far as possi- 
ble) an equal number of threads within the x and y coor- 
dinates, for example a thread block with 64 threads rep- 
resents a two dimensional grid of 8 x 8 threads. 



tions (or equivalently the number of GPU thread blocks 
launched) from 2 to 192. Higher numbers of individ- 
ual calibrations did not change the algorithm dynamics 
(specifically the optimum thread block size as shown in 



figure 11c I 



Thirdly, the calibration accuracy level esABR calculated 
as the RHS of ^ varies from 10"^ to 10 '2. 



Secondly, we vary the number of individual calibra- 



Finally, we vary our parallel calibration method to op- 



13 



erate under different configurations relating to steps 3 
and 4 listed in section |2.4| In all cases however, the 
first iteration undertakes step 3 using the zero line (ZL) 
method and does not undertake step 4. This is because 
the relative value (RV) method of step 3 and local mesh 
refinement of step 4 rely on being very close to the re- 
gion of C which is assumed to be unavailable during the 
first iteration. For subsequent iterations the following 
configurations are used: 

VI represents continued usage of the zero line (ZL) 
method in step 3 (step 4 is not undertaken). 

V2 represents usage of the relative value (RV) method 
in step 3 (step 4 is not undertaken). 

V3 represents usage of the relative value (RV) method 
in step 3 and local mesh refinement in step 4. 

In terms of performance the zero line (ZL) method was 
always inferior to the relative value (RV) method, hence 
only the relative value (RV) method was used in con- 
junction with local mesh refinement. GPU thread blocks 
of size smaller than 36 did not converge for all of our 
test cases and have thus not been included within our 
analysis. Non-convergence was due to an insufficient 
initial discretisation of a which led to the region of C 
not being captured in the initial iteration. 

we show results when esABR — 10-'" 



11a 



In figure 

and observe that V3 using 64 threads has the best per- 
formance. Figure |1 lb| shows results from V3 and de- 
picts the effect on execution time and iterations needed 
when using different values of esABR- In figure 11c we 



vary the number of calibrations scheduled on the GPU 
from 2 to 192. We observe that when the number of 
calibrations scheduled is very low, the dominant factor 
in execution time is the number of required iterations 
whereby larger thread blocks reduce the number of re- 
quired iterations which consequently reduces execution 
time. As the number of calibrations scheduled increase 
however, the optimum execution time is a compromise 
between the number of required iterations and the num- 
ber of concurrent thread blocks scheduled within the 
GPU processor whereby smaller thread blocks result in 
more concurrent thread blocks scheduled on the GPU 
which consequently reduces execution time. The use 
of 64 threads within a thread block is shown to offer 



the best overall performance. Figure 1 Id shows results 
when esABR - 10"'" and provides an overview of the 
number of iterations needed for each calibration con- 
figuration where V3 is shown to have the best perfor- 
mance. 

Our GPU implementation is also compared briefly 



against a single threaded CPU implementation that per- 
forms SABR calibration based on a typical three dimen- 
sional gradient descent algorithm, results of which are 
shown in table [T] Results indicate that the use of the 
relative value (RV) method with local mesh refinement 
(V3) can outperform a single threaded CPU implemen- 
tation by a factor of 6 to 8 x. Additionally, our GPU 
implementation is shown to offer better performance for 
higher accuracies of esABR- This is attributed to a bet- 
ter final rate of convergence within our chosen V3 con- 
figuration, an example of which is shown in figure |6] 
As stated, within V3 an initial iteration uses the zero 
line (ZL) method, subsequently the relative value (RV) 
method with local mesh refinement is employed. Since 
local mesh refinement is designed for the region very 
close to C, several iterations are needed before a and p 
bounds are based on the finest subgrid (as seen in fig- 
6el and |6B, at this point subsequent convergence 



ures 



settles to a linear rate dependent on the domain size of 
the finest subgrid. The example in figure [6] suggests a 
final linear convergence rate of (9^10 



3. Cumulative probability lookup tables 

This section will present parallel algorithms to form 
lookup tables consisting of cumulative probability dis- 
tributions. A typical cumulative probability distribution 



is shown in figure 12a 



3.1. Background 

Given a random variable, namely a future asset price 
or value F{T), P{x) is chosen to represent a cumulative 
probability which is defined as the probability that F(T) 
will be assigned a value less than or equal to a particular 
asset value x such that: 



P{x) = Prob(F(r) < x). 



(24) 



Within our subsequent algorithms the future asset price 
F{T) is assumed to have a probability distribution im- 
plied by the SABR model, presented in section |2] how- 
ever our algorithms are applicable to all probability 
distributions that exhibit clustering effects as described 
later in this section. 

As part of our derivative pricing model we undertake 
a large Monte Carlo simulation, within which we are 
given cumulative probabiUties P(x) and are required to 
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Asset value x Asset value x 



(a) Example cumulative probability distribution 



(b) Example linear interpolation error with error caps f * 



Figure 12: (a) shows an example cumulative probability distribution, (b) shows the corresponding linear interpolation 
error based on ( 27 1. For the no error cap case the maximum interpolation error is centred around the greatest curvature 
within [i 



(a) We further observe that the maximum interpolation error is always clustered around a series of peaks. 



calculate the corresponding asset value x. That is we are 
required to calculate the inverse probability: 



(25) 



where P'"^ {?{■)) = (■)• Using the SABR model it is 
possible to calculate P( ) cheaply analytically, however 
calculating an inverse requires a computation- 

ally expensive iterative method as no analytic inverse 
exists. Within our stated Monte Carlo simulation we 
are required to conduct numerous unique evaluations of 
P'^{-) equivalent to the size of the Monte Carlo simula- 
tion s, where s is typically of (9(10^ to 10^). Since the 

p f"' 
relationship x P(x) (and conversely P(x) — > x) is 

one-to-one and monotone it is typical to reduce compu- 



tational effort by approximating ( 25 1 through a lookup 
table of size A^, where < s. Such a lookup table con- 
sists of an array x containing discrete ordered values 
of X and an array containing corresponding values 
of P{x). To calculate an inverse as shown in ( |25| ) we 
use the following linear interpolation approach: firstly, 
the closest entries to our input P(x) within the array 
are found, namely Px{a) and P_^ib) (e.g. through a 
binary search). Subsequently we interpolate between 
corresponding values within the array x, namely x{a) 
and x{b), such that p5] l is approximated as (where 
a < b): 



x(b) — x(d) 
X = x{a) + ^-^ — ^-^(Pix) - f ,(«)). 
PAb)-PAa) 



(26) 



3.2. Interpolation error 

Lookup table linear interpolation error ^ can be prac- 
tically estimated by the absolute difference between an 
interpolated mid-point value and an analytic mid-point 
value. Assuming the size of x and P^ is A^, an array con- 
sisting of such mid-point interpolation errors ^ will be 
of size N - I. Each element of this an^ay can be shown 
as: 



x(i) + xii+l)\ P,ii) + P ,(/ + 1 ) 



2 / 2 

where / e {I,--- ,A^- 1). (27) 



It is possible to cap the interpolation error calculated in 



( [27] i to a maximum value (as shown in figure 12b i. In 
doing so we refine the discretisation of x within the ar- 
ray X (and correspondingly the array P^) when the inter- 



polation error calculated in ( 27 1 is greater than (*, this is 



shown in figure 13 Consequently the lookup table size 



A^ may vary for different values of (*. It is further noted 
(as shown within figure |12b[ ) that maximum interpola- 
tion error is clustered around a series of peaks, critically. 
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our subsequent algorithms are optimised for probability 
distributions that exhibit this phenomenon. 



Step 1: Build two arrays jc, Pj (called A) of size N = 12: 



1.2e-05 
8.0e-06 
4.0e-06 
O.Oe+00 



Asset value x 



(a) No erTor cap ^* , uniform discretisation of x 



2.5e-06 
2.0e-06 
1.5e-06 
l.Oe-06 
5.0e-07 
O.Oe+00 



Asset value x 
(b) EiTor cap f * = 2.5e-06, non-uniform discretisation of x 

Figure 13: I. error = interpolation error based on «2J\ 



Plots are based on the same data set as figure 12b In |(b)| 
refining the discretisation of x results in lower interpo- 
lation error (. 



3.3. Algorithm descriptions 

Within this section we present several algorithms used 
to create cumulative probability lookup tables. We as- 
sume that we have a minimum grid size or lookup table 
size and that we are provided with lower and upper 
values for the array x, namely Xs and Xf respectively. 
Algorithms CPUl and CPU2 are single threaded CPU 
algorithms used to benchmark results. Algorithms FS, 
DSl, DS2 and DS3 are GPU algorithms constructed 
such that each lookup table (consisting of x and P^) is 
created within a separate GPU thread block. 

3.3.1. Naive CPU algorithm ( CPUl ) and optimised for 
clustering CPU algorithm (CPU2) 

Starting from a single initial pair i(l) = x^ and Pj^(l) - 
P(xs), new pairs are added based on a fixed increment 
of X, namely Ax, where: 



Ax - 



N-l 



(28) 



Each additional i-th pair within x and P^ is calculated 
as: 



x(i + I) - x(i) + Ax, 
PAi+l) =P(x(i) + Ax). 



(29) 
(30) 



Step 2: Build array ( of size N ~ \, locate contiguous en'or zones: 

^12 3 4 5 6 7 



12 3 4 5 6 7 8 910 11 

zone 1 where ^, > zone 2 where ^, > ^* 

Step 3: Memory shifting for each of the contiguous error zones: 

A |l|2|3|4|5|6|7|8|9|lo| | | | |ll|l2| 

First shift 



A I 1 I 2 I 3 I 4 

^ 1 2 3 4 



5 6 7 



Second shift 

Step 4: Update vacant values within the refined error zones: 

A I 1 I 2 I 3 I 4 I I I 5 I 6 I 7 I 8 I al I a2 I ;i3 I u4 I 11 I : 
^ 1 2 3 4 5 6 7 bl b2 b3 b4 b5 11 



New values in zone 2 



A 1 2 cl c2 c3 c4 5 6 7 



al a2 a3 a4 11 12 



i I 1 I dl 



d2 d3 d4 d5 5 6 7 bl b2 b3 b4 b5 11 



New values in zone 1 



Figure 14: An illustration of algorithm DS3 creating 
a cumulative probability lookup table where the initial 
grid size = 12. The shifting of points (step 3) and the 
update of vacant values (step 4) is conducted sequen- 
tially for each contiguous error zone found. Within each 
error zone however the shifting of points (step 3) and the 
update of vacant values (step 4) is undertaken by multi- 
ple parallel threads. 



After each new pair is added, ^, shown in ( 27 1 is evalu- 
ated. If it is found that ^, is greater than our chosen in- 
terpolation error cap (*, the points x{i + 1) and Pj.(/ + 1) 
are discarded and replaced with a pair of points based 
on half the previous increment of x, that is: 



x(i + I) - x(i) + 



Ax 



PJi+ 1) = P x(;) + 



Ax 



(31) 
(32) 



The process of discarding and replacing pairs (x(i + I ) 
and P^(i +1)) and halving the increment Ax is contin- 
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ued until < Subsequently algorithm CPUl reverts 
to adding new pairs based on the original increment of 
Ax shown in ( [28] l whereas algorithm CPU2 adds new 
pairs based on doubling the previously experienced in- 
crement of X (with the maximum increment being the 
original increment of Ax shown in (28 i). Algorithm 



(i+i. These updated values are calculated based on halv- 



CPU2 is thus better able to capture clustering effects (as 
shown in figure |12b| i and consequently has better per- 
formance as demonstrated in the results within section 
3.4 Additional pairs in both algorithms continue to be 
added until x(i) > Xf. 

3.3.2. GPU fixed size algorithm (FS) 

This GPU algorithm constructs the arrays x and of 
size based on a fixed increment of Ax shown in ( |28| ). 
Each element within these arrays is calculated by a sep- 
arate parallel thread. The algorithm operates in a fixed 
grid size A^ and does not attempt to cap maximum inter- 
polation error ^* . 

5.5.5. GPU dynamic size algorithms (DS) 

These GPU algorithms attempt to cap maximum inter- 
polation error and we consider three possible meth- 
ods of implementation, namely algorithms DSl, DS2 
and DS3. 

The algorithms initially construct the arrays x and Pj, of 
size A^ as shown in algorithm FS. Next an error array 
( of size A^ - 1 is constructed as shown in (27 1, where 



each element within ^ is created by a separate parallel 
thread. 



Algorithms DSl and DS2 

The algorithms use a single thread to loop through the 
elements where / e { 1 , ■ ■ ■ , A^ - 1 ) and are halted at / 
when ^, > 

When halted at /, values within x and P^ corresponding 
to elements {i + 1, ■ ■ ■ , A^) are transferred or shifted to 
elements {i + 2,- ■ ■ ,N+ 1) and values within ^ corre- 
sponding to elements {i + 2,- ■ ■ ,N- 1) are transferred 
to elements {/ -i- 3, ■ ■ ■ , A^). As a result the elements 
x(i + 1), P^{i + 1), and ^j+i are considered vacant. Al- 
gorithm DS 1 undertakes this shift using a single thread 
whereas algorithm DS2 undertakes this shift using mul- 
tiple threads in parallel. 

Additionally, when halted at / (and after the previously 
stated memory transfer or shift) a single thread updates 
values for the vacant elements x(i + I), P^(i + 1), and 



ing the original increment of x. Ax, as shown in (31 
and ( [32| . The algorithms do not increment the halted 
counter / until < in doing so it is possible that mul- 
tiple memory shifts and updates on x(i 4-1), P_x{i + 1), 
and (i+i are needed. For every such memory shift and 
update, the increment Ax is halved and the array size A^ 
is incremented by one. 

Algorithm DS3 

The algorithm uses a single thread to loop through the 
elements where / e {1, ■ ■ ■ , A^ - 1). During this loop 
the algorithm locates contiguous regions or error zones 
where {(t, ■ ■ ■ ,^i+„} > C ■ For each located error zone, 
multiple parallel threads shift elements within x, P^ and 
X in order to vacate a space that doubles the number of 
elements situated within an error zone. Therefore, if an 
error zone consisted of n points, the shift will result in 
the same error zone now occupying 2n points, thereby 
doubling the error zone's grid refinement. Within this 
vacant region (of 2n points) values within x, P^ and X 
are updated by multiple parallel threads. The algorithm 
continues this process of locating and refining contigu- 
ous regions or error zones until all elements within the 
error array X have interpolation error Xi < f*. A visual 
depiction of this algorithm is given in figure [T4] 

Critically in terms of performance, the use of contigu- 
ous regions or error zones is advantageous due to the 
clustering of points with higher interpolation error (as 



shown in figure 12b i, whereby clustering reduces the 
number of zones (which are calculated sequentially) and 
increases the number of contiguous points in a zone 
(which are calculated in parallel). 

3.4. Performance results 

Results are generated using the system properties shown 
in Appendix A Our results consider the construction 



of 192 individual lookup tables, this represents a typi- 
cal problem size (results for smaller problem sizes of- 
fered similar performance characteristics as shown in 
section |4|. As stated, within our GPU implementation 
each lookup array (consisting of x and Px) is created 
within a separate GPU thread block. 

In table|2]we list algorithm execution times and average 
lookup table grid sizes A^ (where the initial grid size is 
preset to A^ = 500) for various interpolation error cap 
values Algorithm DS3 strongly outperforms algo- 
rithms DS 1 and DS2 due to the aforementioned cluster- 
ing effect captured by algorithm DS3. Algorithm CPU2 
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Error cap f * 


None 


5e-4 


le-4 


5e-5 


le-5 


5e-6 


le-6 


5e-7 


le-7 


Grid size N 


500 


503.0 


543.1 


585.0 


785.9 


959.3 


1,674 


2,232 


4,573 


DSl f (ms) 


2.8 


30.8 


249.5 


507.6 


1,633.8 


2,648 


N/A 


N/A 


N/A 


DS2 t (ms) 


3.2 


4.6 


14.7 


26.4 


79.5 


127.6 


N/A 


N/A 


N/A 


DS3 t (ms) 


2.6 


4.3 


8.2 


10.6 


16.1 


20.0 


31.6 


41.8 


80.9 


CPUl t (ms) 


114.2 


118.5 


134.3 


156.4 


278.0 


406.2 


950.5 


1,482 


3,697 


CPU2 t (ms) 


115.5 


115.5 


128.6 


143.1 


216.4 


272.2 


504.8 


697.2 


1,469 


FS t (ms) 


0.7 


N/A 


N/A 


N/A 


N/A 


N/A 


N/A 


N/A 


N/A 


Speedup 


44.4 


26.8 


15.7 


13.5 


13.4 


13.6 


16.0 


16.7 


18.2 


DS3 Tpp 


0.005 


0.009 


0.015 


0.018 


0.020 


0.021 


0.019 


0.019 


0.018 


CPU2 Tpp 


0.231 


0.230 


0.237 


0.245 


0.275 


0.284 


0.302 


0.312 


0.321 



Table 2: Algorithm performance for the generation of 192 cumulative probability lookup tables. Grid size - 
average grid size where the initial grid size is preset to N = 500. t - execution time (ms). DS 1/2/3 = GPU dynamic 
size algorithms, FS - GPU fixed size algorithm with no error cap, CPUl/2 = CPU algorithms. DS1/DS2 show results 
limited to an error cap ^* = 5e-6. Speedup reported as CPU2 time / DS3 time. Tpp represents the time taken per point 
and is calculated as Time (ms) / Grid size (A^). CPU results are based on a single threaded implementation. 
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(a) Resultant lookup table grid size N for various enw caps 
Ail algorithms depicted near identical resultant grid 
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(b) Algorithm execution times for various error caps f * . DS3 
= GPU dynamic size 3 algorithm. CPUl/2 = CPU algo- 
rithms. Grid size = average resultant grid size N. 



Figure 15: Performance of algorithms generating cumulative probability lookup tables for various interpolation error 
caps 



also outperforms algorithm CPUl due to the same rea- 
son. 



In figure 15a we compare the resultant lookup table 
grid size for various error caps Within figure 



15b we observe in the higher accuracy region where 
^* = 5 X 10"^ to 1 X 10"^ that algorithm CPUl markedly 
deteriorates as accuracy increases, with calculation time 
of OiN"^), where c > 1 . This was due to the extremely 
heavy discarding of points as is intrinsic within algo- 



rithm CPUl. In contrast (within the same higher ac- 
curacy region) the performance of algorithms DS3 and 
CPU2 is based on a superior stable calculation time of 
0(N). Such stability is evidenced by the consistent Tpp 
(time taken per point) ratio shown in table |2] and is at- 
tributed to the described clustering effect captured by al- 
gorithms DS3 and CPU2. As accuracy increases within 
the stated higher accuracy region, algorithm DS3 main- 
tains a very stable Tpp ratio whereas algorithm CPU2 
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is shown to experience a small increase in Tpp ratios. 
This results in a small increase in comparative CPU2 / 
DS3 speedups as shown in table |2] (the depiction of a 
more efficient CPU implementation is outside the scope 
of this paper). 



Of further interest within figure 15b is the region where 
we go from no error cap to the smallest error cap of f * = 
5 X 10"^ which results in a small increase in the average 
lookup table grid size (from - 500 to = 503 
as shown in table [2|i. Our CPU implementations have 
a very low time penalty for the increased grid size due 
to minimal discarding alongside the very few number of 
additional points needed within A^. In contrast algorithm 
DS3 approximately doubled in execution time due to 
the additional points being calculated sequentially from 
the original set of points (despite such points being very 
few and despite being themselves calculated in paral- 
lel). 





Coupons 


96 


48 


96 


48 




Valuations 


192 


96 


192 


96 


CPU 


Step 1 (ms) 


20.8 


11.3 


42.8 


20.0 


stages 


Step 2 (ms) 


143.1 


73.0 


2072 


1033 




Total (ms) 


163.9 


84.4 


2115 


1053 




error 


3e-7 


4e-7 


3e-9 


3e-9 


GPU 


Step 1 (ms) 


5.9 


3.3 


7.9 


4.2 


stages 


Step 2 (ms) 


10.6 


5.3 


109.5 


58.6 




Total (ms) 


16.5 


8.5 


117.4 


62.8 




Rc error 


2e-7 


3e-7 


2e-9 


2e-9 


Spee- 


Step 1 (X) 


3.54 


3.46 


5.41 


4.78 


dups 


Step 2 (x) 


13.53 


13.89 


18.93 


17.63 




Total (X) 


9.96 


9.89 


18.02 


16.78 



Table 3: Overall model performance results. Valuations 
refer to the number of SABR calibrations (step 1) or 
the number of created probability lookup tables (step 
2). Speedups reported as CPU time / GPU time. Rc 
error is based on a benchmark value of Rc calculated in 
section |4T| CPU results are based on a single threaded 
implementation. 



4. Overall empirical error and performance 



composed of n coupons, where the price of such deriva- 
tives is equal to Yj'Li P^c^ where PV^ is the individ- 
ual and independent price of the i-th coupon calculated 
as: 



PVr = X Not X 5 X DF, 



(33) 



where Rc is the pricing model calculated coupon rate, 
Not is the fixed notional amount of the derivative con- 
tract reported in units of a currency (e.g. USD), 6 is the 
time interval upon which Rc acts and DF is a discount 
factor applied. The magnitude of Rc x Not x 6 x DF is 
typically of 0{lQ^ to 10"^), therefore in order for PV^. to 
be accurate to within 1 unit of a currency (e.g. 1 USD 
which is a typical choice) the model calculated coupon 
rate Rc must be accurate to 0{lQ^^ to 10"''). 

The coupon rate Rc is calculated through the follow- 
ing sequential steps: 1. SABR calibration, 2. cumula- 
tive probability lookup table construction and 3. Monte 
Carlo simulation. Within our error and performance 
analysis we focus on steps 1 and 2 only; error and per- 
formance analysis of the Monte Carlo simulation in step 
3 is outside the scope of this paper. As described in sec- 
tions [2] and [3] GPU based calculations relating to each 
SABR calibration (step 1) and each probability lookup 
table (step 2) are undertaken on separate GPU thread 
blocks. 



4.7. Error test ] 

Within our first test we fix the accuracy of step 1 at ma- 
chine precision, such that esabr calculated as the RHS 
of^ is of (9(10-"'). Next we calculate a benchmark 
value of Rc with step 2 interpolation error shown in ( 27 1 
capped at our lowest feasible level of ^* - 1 x 10^''^ 
(this value resulted in a lookup table size A^ of 0(1Q^)). 
Next we calculate a number of test values of Rc with 
higher interpolation error caps of ^* > 1 x 10 The 
differences between such test values and the benchmark 
value are reported as errors of Rc- Figure 16a depicts 



error cap values needed to calculate Rc to our target 
accuracy of 0(10 '^ to lO '*). 



This section presents an analysis of empirical error 
propagation and performance results for our derivative 
pricing model which incorporates both SABR calibra- 
tion (as described in section[2]l and the creation of cumu- 
lative probability lookup tables (as described in section 
[3]l. Our chosen category of derivatives to be priced is 



4.2. Error test 2 

Within our second test we fix the accuracy of step 2 at 
^* = 5 X 10"^. Using the benchmark value of Rc cal- 
culated in our first error test, we depict magnitudes of 
esABR needed to calculate Rc to our target accuracy of 
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Test Number 

(a) Various interpolation error caps f*. FS = no error cap. 
The accuracy level esABR is fixed at (9(10"'*). 
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(b) Various SABR accuracy levels esABR- The interpolation 
error cap is fixed at 5e-8. 



Figure 16: Error in coupon rate Rc for various SABR accuracies esABR calculated as the RHS of (j?]) and for various 
interpolation error caps ^* based on pT] ). error is based on a benchmark value of Rc calculated in section [4~T] The 
required accuracy of Rc depends on parameters within (33 i and is typically of (9(10^* to 10"^). Upper error bounds 
are shown. 



(9(10^^ to lO"*^). Our choice ^* = 5 x 10"** is motivated 
by the fact that this value enables Rc to be calculated 
to our highest target accuracy of C)(10"^). Since w e fix 
^* = 5 X 10"**, the best accuracy shown in figure 16b 
is limited by the bound of ^* = 5 x 10"** within figure 

[Tea 



43. Timing results 

Within our timing results we utilise two test sets. The 
first test set uses paramete rs {cs abr - 1 x 10"^, - 
5 X 10""''), based on figures 16a and 16b both these pa- 



rameters result in a predicted Rc error of (9(10"^). The 
second test set uses parame ters { 6sa br - 1 x 10"'°, ^* = 
5 X 10"^), based on figures 16a and 16b both these pa- 



rameters result in a predicted Rc error of (9(10 ^). Pa- 
rameters are chosen to reflect the typical desired accu- 
racy ofRc which is of 0(10"'' to 10"^). 

The two test sets are applied to example derivative con- 
tracts having 48 or 96 coupons. Our chosen derivative 
pricing model is used to value spread derivatives and as 
a result, for each coupon present, two sets of SABR cal- 
ibrations and cumulative probability lookup tables are 
needed. Results are presented in table [3] where Rc er- 
rors are shown to match our stated predictions. The total 
performance speedup (single threaded CPU time / GPU 



time) is from 10 to 20 x. Reiterating our earlier per- 
formance analysis (in sections |2] and |3]l we note that the 
GPU implementation offers superior comparative per- 
formance with more accurate error bounds. 



5. Conclusions 

This paper has presented novel implementations of it- 
erative algorithms common within derivative pricing 
models, namely SABR calibration and the creation of 
cumulative probability lookup tables. The proposed 
SABR calibration method is well suited to parallel ar- 
chitectures such as GPUs where use of the relative value 
(RV) method with local mesh refinement offered fast 
convergence. The SABR calibration method was also 
shown to be correct for SABR models that are consistent 
with a number of stated properties. Further work may 
consider a direct mathematical analysis of the under- 
lying SABR equations to determine correctness of the 
stated properties and may also consider instances where 
calibration is based on more than three market instru- 
ments. The creation of cumulative probability lookup 
tables through the proposed dynamic size 3 algorithm 
(DS3) is also well suited to parallel architectures due to 
the use of parallelism to capture clustering effects that 
arise within certain probability distributions. 
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Appendix A. System Properties 

Computational results are based on the following: CPU 
results are based on an Intel Xeon L5640 processor with 
results reported from a single thread process using C-i-i- 
code compiled using the Microsoft Visual Studio 2010 
compiler. GPU results are based on an NVIDIA Tesla 
M2070 (Fermi generation) processor using CUDA C 
code compiled under the NVIDIA CUDA 4.2 runtime 

API. Both CPU and GPU implementations are based on 
calculations using double precision floating point vari- 
ables. 
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